[Spark] Make `delta.dataSkippingStatsColumns` more lenient for nested columns #2850

Kimahriman · 2024-04-04T15:56:48Z

Which Delta project/connector is this regarding?

Description

Resolves #2822

Make delta.dataSkippingStatsColumns more lenient for nested columns by not throwing an exception if a nested column doesn't support gathering stats. This more closely matches the behavior of the dataSkippingNumIndexedCols which allows for unsupported types in those columns (and seems to still gather null counts for those unsupported types). This also allows more use cases where you might have a wide variety of types inside a top level struct, and you simply want to gather stats on whatever columns inside that struct you can.

I kept the duplicate column checking in place to avoid less changes, but I'm not sure how necessary that really is besides letting users know they are doing something dumb.

How was this patch tested?

A couple tests were removed that were specifically testing for the now-allowed behavior, and a new test was added to verify the new behavior works.

Does this PR introduce any user-facing changes?

Yes, specifying a struct with unsupported stats gathering types in delta.dataSkippingStatsColumns is now allowed instead of throwing an exception.

Kimahriman · 2024-04-11T11:41:34Z

@kamcheungting-db since you added this originally

Kimahriman · 2024-08-16T21:45:53Z

@longvu-db since you're reviewing my other PR 😊. This would be a great quality of life improvement for complex nested schemas

longvu-db · 2024-08-16T23:34:37Z

@Kimahriman Will take a look!

spark/src/main/scala/org/apache/spark/sql/delta/stats/StatisticsCollection.scala

…csCollection.scala Co-authored-by: Thang Long Vu <[email protected]>

spark/src/test/scala/org/apache/spark/sql/delta/stats/StatsCollectionSuite.scala

longvu-db · 2024-08-17T01:10:28Z

spark/src/test/scala/org/apache/spark/sql/delta/stats/DataSkippingDeltaTests.scala

@@ -599,6 +599,27 @@ trait DataSkippingDeltaTestsBase extends DeltaExcludedBySparkVersionTestMixinShi
    deltaStatsColNamesOpt = Some("b.c")
  )

+  testSkipping(


Should we consider double struct test?

Added a double nested struct as well

longvu-db · 2024-08-18T12:50:29Z

spark/src/test/scala/org/apache/spark/sql/delta/stats/DataSkippingDeltaTests.scala

+      },
+      "i": 10
+    }""".replace("\n", ""),
+    hits = Seq(


Could we have hit tests for elements inside struct and the double struct as well?

Btw click on re-request review so that I get notified.

Added more hit tests

My understanding is that when we are inside struct, we gather stats for the valid datatypes inside the struct, but the invalid ones we still don't right? Would we want to test that we still don't skip the invalid datatypes inside the struct?

Also, if we can check another skipping valid type beside integer, like timestamp, string, ... that would be great

Technically null counts are collected, but they can't be used for skipping. We could add that to the test, but seems kind of out of scope because you can't do min/max things with arrays/maps anyway, so it'd have to be like an element contains or something

I think there are other non-skipping eligible datatypes than lists datatypes

I'm not sure how to specify a Binary type in the inferred JSON. Is that the only non-complex type?

spark/src/test/scala/org/apache/spark/sql/delta/stats/StatsCollectionSuite.scala

longvu-db · 2024-08-18T12:59:33Z

@Kimahriman Just out of curiosity, why are you making this change? Do you have some complex nested schemas that you want to skip data on?

Kimahriman · 2024-08-18T13:29:15Z

@Kimahriman Just out of curiosity, why are you making this change? Do you have some complex nested schemas that you want to skip data on?

Yes, we have structs with mixed arrays and primitive types. Currently I have to recursively count the fields to set the num index columns to the right value

longvu-db · 2024-08-18T14:59:56Z

spark/src/main/scala/org/apache/spark/sql/delta/stats/StatisticsCollection.scala

      }
+    case _ if insideStruct => columnPaths.append(name)


When we are inside the struct, we keep appending columns regardless of the data type right?

So would the description of the "@param columnPaths" no longer correct?

This seems outdated as well

delta/spark/src/main/scala/org/apache/spark/sql/delta/stats/StatisticsCollection.scala

Line 481 in da5a5d2

* 2. Delta statistics column must exist in delta table's schema.

Technically it is "valid", it will actually collect null counts on all fields regardless of if min/max is supported

longvu-db · 2024-08-18T15:00:29Z

spark/src/main/scala/org/apache/spark/sql/delta/stats/StatisticsCollection.scala

@@ -458,15 +458,20 @@ object StatisticsCollection extends DeltaCommand {
   * @param name The name of the data skipping column for validating data type.
   * @param dataType The data type of the data skipping column.
   * @param columnPaths The column paths of all valid fields.
+   * @param insideStruct Whether the datatype is inside a struct already, in which case we don't


Instead of saying "we don't clear", we can make it clearer what we mean by that I think.

Changed, not sure if it's anymore clear

longvu-db · 2024-08-18T15:07:00Z

spark/src/test/scala/org/apache/spark/sql/delta/stats/StatsCollectionSuite.scala

@@ -608,7 +590,8 @@ class StatsCollectionSuite

  Seq(
    "BIGINT", "DATE", "DECIMAL(3, 2)", "DOUBLE", "FLOAT", "INT", "SMALLINT", "STRING",
-    "TIMESTAMP", "TIMESTAMP_NTZ", "TINYINT"
+    "TIMESTAMP", "TIMESTAMP_NTZ", "TINYINT", "STRUCT<c3: BIGINT>",


Looks like there are relevant tests in DataSkippingDeltaTests as well

delta/spark/src/test/scala/org/apache/spark/sql/delta/stats/DataSkippingDeltaTests.scala

Line 442 in da5a5d2

"b.c.d < 0",

, could you take a look at this whole file as well?

Those tests are for specifying the number of indexed columns, not columns by name

kamcheungting-db · 2024-08-18T19:59:25Z

Hi Adam,

I intentionally throw error when the column doesn't support delta stats.
The reason behind is: if we don't throw error, user does;t know why the delta stat is not collected. It is hard for user to differentiate between unsupported data type with a bug.

kamcheungting-db · 2024-08-18T20:03:57Z

If we want to swallow the supported error, there should be a way to tell user that the column is unsupported.

Kimahriman · 2024-08-18T20:13:05Z

Happy to log a warning, but it's not actually unsupported. Null counts are still collected for all columns. If users are confused why their arrays and maps aren't collecting mins and maxes, they have bigger problems. This behavior matches how the number of indexed columns work.

kamcheungting-db · 2024-08-18T20:56:37Z

**Kimahriman ** commented Aug 18, 2024 •

Warming sounds a good indicator.

Kimahriman · 2024-08-19T20:08:07Z

Warming sounds a good indicator.

Added a warning log, let me know if you have thoughts on exact verbiage, just copied the error text

kamcheungting-db · 2024-08-20T00:04:25Z

spark/src/main/scala/org/apache/spark/sql/delta/stats/StatisticsCollection.scala

      }
    case SkippingEligibleDataType(_) => columnPaths.append(name)
+    case d if insideStruct =>


should we also make the non-struct field have the same behavior?

Up to you. The idea was if you directly specify an in-eligible type, throw an exception because you did something wrong. If you specify a struct, just work with the sub fields that are supported

Then, I am good about it.
Thank you

Kimahriman added 8 commits March 29, 2024 19:30

Starting

a9412bc

Remove another test

a864205

Comment out irrelevant tests

975f20b

Try to fix scalastyle

6411e7a

Add duplicate col checking back

e68028d

Remove commented out

cbb0447

Add test for nested complex types

3759afd

Clean up commented out code

329a571

Merge branch 'master' into lenient-stats-cols

b911d6a